{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 验证和导入项目元数据 <a class=\"anchor\" id=\"top\"></a>\n",
    "\n",
    "在这个Jupyter笔记本中，您将从 `01_Validating_and_Importing_User_Item_Interaction_Data.ipynb` 中结束的地方开始来构建一个项目元数据集。这将允许您使用过滤器，以及 `User-Personalization` 或者 `HRNN-Metadata` 算法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "要运行当前这个笔记本，您需要运行之前的笔记本 `01_Validating_and_Importing_User_Item_Interaction_Data`，其中您创建了一个数据集然后将交互数据导入到 Amazon Personalize 中，并且在之前的笔记本最后保存了一些变量，现在您需要将这些值加载到这个笔记本中。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备项目元数据 <a class=\"anchor\" id=\"prepare\"></a>\n",
    "[回到顶部](#top)\n",
    "\n",
    "接下来要做的事情是加载数据并确认数据处于一个良好状态，然后将其保存到CSV中，以便 Amazon Personalize 使用。\n",
    "\n",
    "首先，导入数据科学常用的Python库集合。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "from time import sleep\n",
    "import json\n",
    "from datetime import datetime\n",
    "import boto3\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，打开数据文件并查看前几行数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data = pd.read_csv(dataset_dir + '/movies.csv')\n",
    "original_data.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这其实没有告诉我们关于数据集的更多信息，所以我们将探索更多的原始信息。可以看到类型 genres 通常被组合在一起放到列表中，这对我们来说很好，因为 Personalize 支持这种结构。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由此，您可以看到数据集中共有3列条目（完整数据集为62,000+，小数据集为9742）。\n",
    "\n",
    "这是一个非常小的数据集，只有电影ID movieId、电影名称 title 和适用于每个条目的类型列表 genres。然而，在Movielens数据集中还有额外的数据可用，例如，片名中包括了电影发行的年份。让我们再创建一列元数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "original_data['year'] =original_data['title'].str.extract('.*\\((.*)\\).*',expand = False)\n",
    "original_data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从项目元数据的角度来看，我们只希望包含与训练模型和/或过滤结果相关的信息，因此我们将删除名称列，保留类型信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_df = original_data.copy()\n",
    "itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]\n",
    "itemmetadata_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在数据处理之后，务必要确认数据格式是否已更改。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazon Personalize有一个默认列 `ITEM_ID`，它将映射到我们的 `movieId`，现在可以通过指定 `GENRE` 来充实信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_df.rename(columns = {'genres':'GENRE', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在数据已经准备好啦！我们只需要将其保存为CSV文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_filename = \"item-meta.csv\"\n",
    "itemmetadata_df.to_csv((data_dir+\"/\"+itemmetadata_filename), index=False, float_format='%.0f')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 配置 Boto3 的 Personalize:\n",
    "personalize = boto3.client('personalize')\n",
    "personalize_runtime = boto3.client('personalize-runtime')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 创建数据集\n",
    "\n",
    "首先，定义一个模式来告诉 Amazon Personalize 要上传的数据集类型。根据数据集的类型，模式中有几个保留和需要强制使用的关键字。更详细的信息可以在[文档](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html)中找到。\n",
    "\n",
    "在这里，您将为项目元数据创建一个Schema，它需要 `ITEM_ID` 和 `GENRE` 字段。它们在Schema中定义的顺序必须与它们在数据集中出现的顺序相同。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Items\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"GENRE\",\n",
    "            \"type\": \"string\",\n",
    "            \"categorical\": True\n",
    "        },{\n",
    "            \"name\": \"YEAR\",\n",
    "            \"type\": \"int\",\n",
    "        },\n",
    "        \n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"personalize-poc-movielens-item\",\n",
    "    schema = json.dumps(itemmetadata_schema)\n",
    ")\n",
    "\n",
    "itemmetadataschema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "创建Schema后，您可以在数据集组中创建数据集。注意，这时还没有加载数据，我们将在稍后几步操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_type = \"ITEMS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"personalize-poc-movielens-items\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_arn,\n",
    "    schemaArn = itemmetadataschema_arn\n",
    ")\n",
    "\n",
    "items_dataset_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 上传数据到 S3\n",
    "\n",
    "现在已经创建了 Amazon S3 桶，接下来上传用户-项目交互数据的CSV文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "itemmetadata_file_path = data_dir + \"/\" + itemmetadata_filename\n",
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(itemmetadata_filename).upload_file(itemmetadata_file_path)\n",
    "interactions_s3DataPath = \"s3://\"+bucket_name+\"/\"+itemmetadata_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 导入项目元数据 <a class=\"anchor\" id=\"import\"></a>\n",
    "[回到顶部](#top)\n",
    "\n",
    "前面您创建了数据集组和数据集来存放信息，现在执行导入作业，将数据从S3桶加载到 Amazon Personalize 数据集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"personalize-poc-item-import1\",\n",
    "    datasetArn = items_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, itemmetadata_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据集之前，导入作业必须是活动的。执行下面的单元格并等待它显示活动状态，它会每60秒钟检查一次导入作业的状态，最多等待6个小时。\n",
    "\n",
    "导入数据可能需要一些时间，这取决于数据集的大小。在这个实验中，数据导入作业大约需要15分钟。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "max_time = time.time() + 6*60*60 # 6 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n",
    "        datasetImportJobArn = dataset_import_job_arn\n",
    "    )\n",
    "    status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n",
    "    print(\"DatasetImportJob: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入完成后，您就可以启用推荐过滤功能以及 `HRNN-Metadata`。运行下面的单元格，然后继续存储一些值以便在下一个笔记本中使用。完成单元格后，打开笔记本`03_Creating_and_Evaluating_Solutions.ipynb`继续。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store items_dataset_arn\n",
    "%store itemmetadataschema_arn"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
